Some Issues in Automatic Genre Classification of Web Pages
نویسندگان
چکیده
In this paper, two experiments in automatic genre classification of web pages are presented. These two experiments are designed to highlight three important issues related to genre classification: corpus composition and genre palettes, feature representativeness, and exportability of classification models. Results show the influence of corpus composition and genre palette on classification rates. They also show how well and to what extent feature sets represent genres in a palette, and give an idea of the limitations of the classification models when exported and used for predictive tasks.
منابع مشابه
Performance Improvement of Web Page Genre Classification
The dynamic nature of web and with the increase of the number of web pages, it is very difficult to search required web pages easily and quickly out of thousands of web pages retrieved by a search engine. The solution to this problem is to classify the web pages according to their genre. Automatic genre identification of web pages has become an important area in web page classification, because...
متن کاملSemi-supervised Graph-based Genre Classification for Web Pages
Until now, it is still unclear which set of features produces the best result in automatic genre classification on the web. Therefore, in the first set of experiments, we compared a wide range of contentbased features which are extracted from the data appearing within the web pages. The results show that lexical features such as word unigrams and character n-grams have more discriminative power...
متن کاملCybergenre: Automatic Identification of Home Pages on the Web
The research reported in this paper is part of a larger project on the automatic classification of web pages by their genres. The long term goal is the incorporation of web page genre into the search process to improve the quality of the search results. In this phase, a neural net classifier was trained to distinguish home pages from non-home pages and to classify those home pages as personal h...
متن کاملAn n-gram Based Approach to the Classification of Web Pages by Genre
The extraordinary growth in both the size and popularity of the World Wide Web has created a growing interest not only in identifying Web page genres, but also in using these genres to classify Web pages. The hypothesis of this research is that an n-gram representation of a Web page can be used effectively to automatically classify that Web page by genre. This research involves the development ...
متن کاملTraining a Genre Classifier for Automatic Classification of Web Pages
This paper presents experimentson classifyingweb pages by genre. Firstly, a corpus of 1 539 manually labeled web pages was prepared. Secondly, 502 genre features were selected based on the literature and the observation of the corpus. Thirdly, these features were extracted from the corpus to obtain a data set. Finally, two machine learning algorithms, one for induction of decision trees (J48) a...
متن کامل